最近,我们看到了照片真实的人类建模和渲染的神经进展取得的巨大进展。但是,将它们集成到现有的下游应用程序中的现有网络管道中仍然具有挑战性。在本文中,我们提出了一种全面的神经方法,用于从密集的多视频视频中对人类表演进行高质量重建,压缩和渲染。我们的核心直觉是用一系列高效的神经技术桥接传统的动画网格工作流程。我们首先引入一个神经表面重建器,以在几分钟内进行高质量的表面产生。它与多分辨率哈希编码的截短签名距离场(TSDF)的隐式体积渲染相结合。我们进一步提出了一个混合神经跟踪器来生成动画网格,该网格将明确的非刚性跟踪与自我监督框架中的隐式动态变形结合在一起。前者将粗糙的翘曲返回到规范空间中,而后者隐含的一个隐含物进一步预测了使用4D哈希编码的位移,如我们的重建器中。然后,我们使用获得的动画网格讨论渲染方案,从动态纹理到各种带宽设置下的Lumigraph渲染。为了在质量和带宽之间取得复杂的平衡,我们通过首先渲染6个虚拟视图来涵盖表演者,然后进行闭塞感知的神经纹理融合,提出一个分层解决方案。我们证明了我们方法在各种平台上的各种基于网格的应用程序和照片真实的自由观看体验中的功效,即,通过移动AR插入虚拟人类的表演,或通过移动AR插入真实环境,或带有VR头戴式的人才表演。
translated by 谷歌翻译
最近的神经人类表示可以产生高质量的多视图渲染,但需要使用密集的多视图输入和昂贵的培训。因此,它们在很大程度上仅限于静态模型,因为每个帧都是不可行的。我们展示了人类学 - 一种普遍的神经表示 - 用于高保真自由观察动态人类的合成。类似于IBRNET如何通过避免每场景训练来帮助NERF,Humannerf跨多视图输入采用聚合像素对准特征,以及用于解决动态运动的姿势嵌入的非刚性变形场。原始人物员已经可以在稀疏视频输入的稀疏视频输入上产生合理的渲染。为了进一步提高渲染质量,我们使用外观混合模块增强了我们的解决方案,用于组合神经体积渲染和神经纹理混合的益处。各种多视图动态人类数据集的广泛实验证明了我们在挑战运动中合成照片 - 现实自由观点的方法和非常稀疏的相机视图输入中的普遍性和有效性。
translated by 谷歌翻译
最近,生成的数据无量子化作为一种​​实用的方法,将神经网络压缩到低位宽度而不访问真实数据。它通过利用其全精密对应物的批量归一化(BN)统计来生成数据来量化网络。然而,我们的研究表明,在实践中,BN统计的合成数据在分布和样品水平时严重均匀化,这导致量化网络的严重劣化。本文提出了一种通用不同的样本生成(DSG)方案,用于生成无数据的训练后量化和量化感知培训,以减轻有害的均质化。在我们的DSG中,我们首先将统计对齐缩写为BN层中的功能,以放宽分配约束。然后,我们加强特定BN层对不同样品的损失影响,并抑制了生成过程中样品之间的相关性,分别从统计和空间角度分别多样化样本。广泛的实验表明,对于大规模的图像分类任务,我们的DSG可以始终如一地优于各种神经结构上的现有数据无数据量化方法,尤其是在超低比特宽度下(例如,在W4A4设置下的22%的增益下)。此外,由我们的DSG引起的数据多样化引起了各种量化方法的一般增益,证明了多样性是无数据量化的高质量合成数据的重要特性。
translated by 谷歌翻译
Performance of spoken language understanding (SLU) can be degraded with automatic speech recognition (ASR) errors. We propose a novel approach to improve SLU robustness by randomly corrupting clean training text with an ASR error simulator, followed by self-correcting the errors and minimizing the target classification loss in a joint manner. In the proposed error simulator, we leverage confusion networks generated from an ASR decoder without human transcriptions to generate a variety of error patterns for model training. We evaluate our approach on the DSTC10 challenge targeted for knowledge-grounded task-oriented conversational dialogues with ASR errors. Experimental results show the effectiveness of our proposed approach, boosting the knowledge-seeking turn detection (KTD) F1 significantly from 0.9433 to 0.9904. Knowledge cluster classification is boosted from 0.7924 to 0.9333 in Recall@1. After knowledge document re-ranking, our approach shows significant improvement in all knowledge selection metrics, from 0.7358 to 0.7806 in Recall@1, from 0.8301 to 0.9333 in Recall@5, and from 0.7798 to 0.8460 in MRR@5 on the test set. In the recent DSTC10 evaluation, our approach demonstrates significant improvement in knowledge selection, boosting Recall@1 from 0.495 to 0.7144 compared to the official baseline. Our source code is released in GitHub https://github.com/yctam/dstc10_track2_task2.git.
translated by 谷歌翻译
数十亿人每天都在社交媒体上分享他们的日常生活图像。但是,它们的生物识别信息(例如,指纹)可以很容易地从这些图像中偷走。从社交媒体上泄漏的指纹泄漏的威胁引起了人们对匿名分享图像的强烈渴望,同时保持图像质量,因为指纹充当了终生的个体生物识别密码。为了防止指纹泄漏,通过在图像上添加不可察觉的扰动来作为解决方案出现。但是,现有作品要么在黑盒可传输性方面弱,要么显得不自然。由视觉感知层次结构激励(即,高级感知利用模型共享的语义,这些语义在模型中很好地转移,而低水平的感知提取物则是原始刺激的,并且会引起高视觉敏感性的刺激),我们提出了一个层次的感知噪声,注射框架以解决上述问题。对于黑盒可传递性,我们在指纹方向场上注入保护性噪声,以扰动模型共享的高级语义(即指纹脊)。考虑到视觉自然性,我们通过正规化侧向基因核的响应来抑制低级局部对比度刺激。我们的Fingersafe是第一个在数字(最高94.12%)和现实的场景(Twitter和Facebook,高达68.75%)中提供可行的指纹保护的人。我们的代码可以在https://github.com/nlsde-safety-team/fingersafe上找到。
translated by 谷歌翻译
人群计数已被广泛用于估计安全至关重要的场景中的人数,被证明很容易受到物理世界中对抗性例子的影响(例如,对抗性斑块)。尽管有害,但对抗性例子也很有价值,对于评估和更好地理解模型的鲁棒性也很有价值。但是,现有的对抗人群计算的对抗性示例生成方法在不同的黑盒模型之间缺乏强大的可传递性,这限制了它们对现实世界系统的实用性。本文提出了与模型不变特征正相关的事实,本文提出了感知的对抗贴片(PAP)生成框架,以使用模型共享的感知功能来定制对对抗性的扰动。具体来说,我们将一种自适应人群密度加权方法手工制作,以捕获各种模型中不变的量表感知特征,并利用密度引导的注意力来捕获模型共享的位置感知。证明它们都可以提高我们对抗斑块的攻击性转移性。广泛的实验表明,我们的PAP可以在数字世界和物理世界中实现最先进的进攻性能,并且以大幅度的优于以前的提案(最多+685.7 MAE和+699.5 MSE)。此外,我们从经验上证明,对我们的PAP进行的对抗训练可以使香草模型的性能受益,以减轻人群计数的几个实际挑战,包括跨数据集的概括(高达-376.0 MAE和-376.0 MAE和-354.9 MSE)和对复杂背景的鲁棒性(上升)至-10.3 MAE和-16.4 MSE)。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译